with numerical summaries, and with hypothesis tests (less good)
Summarizing Categorical Data
Our R Packages
library(broom)library(Epi) ## for twoby2() functionlibrary(gt) ## for making tableslibrary(gtExtras) ## for fancier tableslibrary(kableExtra) ## for kbl() functionlibrary(janitor)library(naniar) library(patchwork)library(tidyverse) # always load tidyverse lasttheme_set(theme_test()) # trying a new themeknitr::opts_chunk$set(comment =NA)
I used #| message: false in the code chunk to silence messages about conflicts between R packages.
1000 simulated adults (ages 31-75) living with diabetes, in Cuyahoga County, in one of four race/ethnicity categories and in one of four insurance categories.
One new variable as compared to dm431: residence (Cleveland or Suburbs)
dm1000 Code Book (1 of 3)
Variable
Description
subject
subject code (M-0001 through M-1000)
age
subject’s age, in years
insurance
primary insurance, 4 levels
n_income
neighborhood median income, in $
ht
height, in meters (2 decimal places)
wt
weight, in kilograms (2 decimal places)
dm1000 Code Book (2 of 3)
Variable
Description
sbp
most recent systolic blood pressure (mm Hg)
dbp
most recent diastolic blood pressure (mm Hg)
a1c
most recent Hemoglobin A1c (%)
ldl
most recent LDL cholesterol level (mg/dl)
tobacco
most recent tobacco status, 3 levels
statin
1 = prescribed a statin in past 12m, 0 = not
dm1000 codebook (3 of 3)
Variable
Description
eye_exam
1 = diabetic eye exam in past 12m, 0 = no record of exam in past 12m
If you are providing a data summary, then you should summarize the complete cases, and specify the number of missing values.
If you are intending to use the sample you’ve collected to make an inference about a process or population or to build a model, then you may want to consider whether or not a complete-case analysis will introduce bias.
What do graphs do with missing data?
ggplot(data = dm1000, aes(x = ldl)) +geom_histogram(bins =20, fill ="slateblue", col ="cyan")
Note that 701/(149+144+701) = 701/994 = 0.705, approximately.
How does this compare to the expectation under a Normal model?
Coverage Probabilities in dm1000
Variable
\(\bar{x}\)
\(s\) = SD
n
\(\bar{x} \pm s\)
\(\bar{x} \pm 2s\)
\(\bar{x} \pm 3s\)
sbp
132.8
18.0
994
70.5%
95.0%
99.2%
dbp
74.5
12.4
994
67.5%
95.3%
99.7%
age
57.2
10.1
1000
64.8%
96.6%
100%
ldl
100.7
37.1
822
69.8%
95.6%
99.1%
n_income
35,178
18,776
972
75.0%
95.3%
98.5%
Conclusions about utility of the Normal model?
Do these match the conclusions from the plots? —>
Normal Q-Q plots for dm1000
Hypothesis Testing for Normality?
Don’t. Graphical approaches are far better than tests…
shapiro.test(dm1000$sbp)
Shapiro-Wilk normality test
data: dm1000$sbp
W = 0.98276, p-value = 1.88e-09
The very small p value indicates that the test finds some indications against adopting a Normal model for these data.
Exciting, huh? Alas, not actually useful.
Other Hypothesis Tests of Normality
The nortest package, which I don’t even install as part of our 431 packages, includes many other possible tests of Normality for a batch of data, including:
nortest::ad.test() Anderson-Darling test
nortest::lillie.test() Lilliefors test
nortest::cvm.test() Cramer-von Mises test
nortest::sf.test() Shapiro-Francia test
nortest::pearson.test() Pearson chi-square test
Our dm1000 data: Testing Normality
Variable
S-W
A-D
Lill
C-vM
S-Fr
Pear
sbp
2e-09
3e-08
6e-10
2e-06
9e-09
<2e-16
dbp
0.0001
0.030
0.033
0.078
0.0001
0.0002
n_income has a tiny p value for all tests, but so what?
So does age and so do sbp and ldl…
Simulated Normally distributed data
set.seed(431)sim150 <-rnorm(n =150, mean =100, sd =10)sim1000 <-rnorm(n =1000, mean =100, sd =10)
Variable
\(skew_1\)
\(\bar{x} \pm s\)
\(\bar{x} \pm 2s\)
\(\bar{x} \pm 3s\)
Outliers?
sim150
0.113
67.3%
97.3%
99.3%
1/150
sim1000
0.024
67.7%
95.9%
99.9%
4/1000
Normal Q-Q plots and Shapiro-Wilk?
Shapiro-Wilk: sim150p = 0.682, while sim1000p = 0.038
Why not test for Normality? (1)
There are multiple hypothesis testing schemes and each looks for one specific violation of a Normality assumption.
None can capture the wide range of issues our brains can envision, and none by itself is great at its job.
With any sort of reasonable sample size, the test is so poor at detecting non-normality compared to our eyes, that it finds problems we don’t care about and ignores problems we do care about.
Why not test for Normality? (2)
And without a reasonable sample size, the test is essentially useless.
Whenever you can avoid hypothesis testing and instead actually plot the data, you should plot the data.
Sometimes you can’t plot (especially with really big data) but the test should be your very last resort.
Does a Normal Model fit well?
Do we have…
A histogram that is symmetric and bell-shaped?
A boxplot where the box is symmetric around the median, as are the whiskers, without severe outliers?
A normal Q-Q plot that essentially falls on a straight line?
If in doubt, maybe compare mean to median re: skew, and consider Empirical Rule to help make tough calls?
Big issue: why do you need to assume a Normal model?
Dealing with Categories
Using count to create a tibble of counts
dm1000 |>count(tobacco)
# A tibble: 4 × 2
tobacco n
<fct> <int>
1 Current 274
2 Never 343
3 Former 367
4 <NA> 16
dm1000 |>tabyl(tobacco) |>adorn_pct_formatting() |>adorn_totals() |>gt() |>gt_theme_nytimes() |>tab_header(title ="Table styled like the New York Times")
Table styled like the New York Times
tobacco
n
percent
valid_percent
Current
274
27.4%
27.8%
Former
367
36.7%
37.3%
Never
343
34.3%
34.9%
NA
16
1.6%
-
Total
1000
-
-
There’s also a gt_theme_espn() and several others.
Using geom_bar to show a distribution
ggplot(dm1000, aes(x = tobacco)) +geom_bar()
Augmenting the geom_bar result
tempdat <- dm1000 |>filter(complete.cases(tobacco))ggplot(data = tempdat, aes(x = tobacco, fill = tobacco)) +geom_bar() +geom_text(aes(label = ..count..), stat ="count", vjust =1.5, col ="white", size =8) +scale_fill_viridis_d(option ="C", end =0.8) +guides(fill ="none")
Augmenting the geom_bar result
Using count to create a tibble of counts
dm1000 |>count(statin, tobacco)
# A tibble: 8 × 3
statin tobacco n
<dbl> <fct> <int>
1 0 Current 67
2 0 Former 80
3 0 Never 92
4 0 <NA> 3
5 1 Current 207
6 1 Former 287
7 1 Never 251
8 1 <NA> 13